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1. Introduction 



Abstract. We are developing a Bayesian approach based on Markov chain 
Monte Carlo techniques to search for and extract information about white dwarf 
binary systems with the Laser Interferometer Space Antenna (LISA). Here we 
O present results obtained by applying an initial implementation of this method to 

O^ some of the data sets released in Round IB of the Mock LISA Data Challenges. 

,' , ■ For Challenges IB. 1.1a and lb the signals were recovered with parameters lying 

KfV within the 95.5% posterior probability interval and the correlation between the 

i ' true and recovered waveform is in excess of 99%. Results were not submitted for 

Challenge IB. 1.1c due to some convergence problems of the algorithm; despite 
f\J , this, the signal was detected in a search over a 2 mHz band. 

> 

(N 

O 

o. 

Galactic white dwarf (WD) binary systems are guaranteed sources for the future 
Laser Interferometer Space Antenna, LISA [Tj. Despite the simple nature of the 
expected gravitational radiation - a quasi monochromatic signal - the data analysis 
task becomes challenging due to the tens of millions of such sources in the Galaxy that 
rS ' radiate in the instrument's observational window, with signals strongly overlapping 

in time and frequency space at the LISA output [H [SJ 2] . As a consequence, LISA 
is expected to resolve about 10 4 galactic binaries with the remaining producing a 
"confusion noise" , whose exact magnitude will depend on the resolving power of the 
search methods (as well as the actual astrophysical population in the Galaxy). 

Due to the large number of sources to be analysed at the same time and the 
fact that this number is unknown, Markov chain Monte Carlo (MCMC) techniques 
El [3 [9] and their extension to Reversible Jump Markov chain Monte Carlos 
(RJMCMCs) [TU1 ITTJ [T2] are expected to be one of the most powerful search methods. 
There are several instances in which techniques based on Bayesian inference have 
been successfully implemented in the context of LISA data analysis, not only in the 
search for WD binary systems, but also gravitational radiation from massive-black- 
hole binary inspirals [131 HH H3 [HI [T71 [TB] and extreme-mass ratio inspirals [HI dD] • 

In the summer of 2007, the Mock LISA Data Challenge (MLDC) Task Force 
[21] [22] released a re-issue of the Round 1 challenges, called Challenge IB [23] . In 
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this paper, we present results obtained by applying an initial implementation of a 
Bayesian approach based on a MCMC method to the single galactic binary data sets, 
Challenge lB.l.la-c [23] ■ Several other groups have tackled this problem using a 
number of approaches: Cornish & Crowder [5] [5] El IE] developed algorithms based 
on variations of Monte Carlo Metropolis-Hastings Samplers and successfully applied 
them to searches for overlapping sources; MCMC methods for single-source analysis 
have been explored by several groups, e.g. [9]; Prix & Whclan 25, 26] and Krolak and 
collaborators have developed a matched-filtering approach, based on the computation 
of the JF-statistic [27]. A summary of the results obtained in Challenge IB is provided 
by the MLDC Task Force [23] . 

All the results that we present here, correspond to the blind challenge data sets 
and were obtained before the release of the key files, in December 2007. 

2. Analysis method 

Following a Bayesian approach, we can infer the probability density functions (PDFs) 
of the vector of the unknown model parameters A, given a data set d and some prior 
information W, using Bayes' theorem: 

,r,, wl p(X\W) p(d\X,W) 

p(A|W = — pjdW) — • () 

Here, p(X\d 1 W) is the posterior probability density function of the parameters given 
the observed data (i.e. it is what we arc interested in), p(A|W) is the prior knowledge 
we have about the different parameters, p(d\X, W) is the likelihood function of the 
data given the model and finally, p(d\W) is a normalisation factor independent of the 
unknown model parameters, and therefore irrelevant to this MCMC analysis. 

The gravitational wave (GW) signal emitted by a WD binary system with 
constant orbital frequency is characterised by 7 independent parameters: the frequency 
of the signal /, its amplitude A, two angles to define the sky location of the source 
- here longitude <j> and latitude i - the GW polarisation ip, the inclination angle 
between the angular momentum of the system and an unitary vector parallel to the 
line of sight, l and finally, a constant value to fix the initial phase of the signal ip$. In 
our analysis we also treat the noise level that affects the measurements as unknown; we 
parametrise it with a 2 the (constant) variance of the noise contribution in frequency 
domain in the small band where the signal lies. As a consequence, the analysis of the 
data containing a single galactic binary requires the estimation of an 8-dimensional 
parameter vector A. 

MCMC methods are well suited to compute the joint posterior PDF p(X\d,W), 
and the marginalised posterior PDF for any given parameter (or subset of parameters), 
say Ai: 



p(Ai|d, W) = J dX 2 ...J dX s p(X\d, W) . (2) 

From the PDF above, one can then compute the posterior mean as 

/•OO 

A J= / d\i\ iP {\i\d,W). (3) 



We have implemented a Metropolis-Hastings MCMC algorithm which has the 
property that after an initial "burn-in" period (which is discarded in the generation 
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of the PDFs), it returns samples of A with a probability density equal to the desired 
posterior p(X \ d,W) [28] . 

In an MCMC algorithm, a sequence of points ('a chain') is constructed. The first 
point is chosen randomly according to a uniform prior distribution, then subsequent 
elements of the chain are generated in the following way: 

(i) Given the current member of the chain corresponding to the parameter vector A, 
the new member A' is proposed according to 

X'^Xi + uAi (t = l,...,8), (4) 

where u is a Gaussian random number with zero-mean and unit variance, and Aj 
sets the size of the step (see the next Section for more details) . 

(ii) The new member is accepted with a probability computed according to the 
Metropolis-Hastings ratio: 

a X , X ,=min(l,^j, (5) 

where 7r(A) oc p(A|VF) p(d\X, W) is the so-called target distribution. 

Since the signal from a galactic binary is nearly monochromatic at the LISA 
output, it is advantageous to work in the Fourier domain, and to consider only the 
small frequency band where the signal power is concentrated. In our implementation 



we analyse only the band B w = 1.5 



2(5 + 47r/of sin0)/ m + |/ o |T o . 



bs 



around /o, 



where f m — 1 yr^ 1 , /o is the signal frequency at a given reference time, /q its time 
derivative (for the signal model adopted in MLDC-1B fo = 0), and 1.5 is a safety 
factor. We therefore FFT the time series X, Y and Z of the unequal arm pseudo- 
Michelson outputs in which the MLDC data are distributed [53], and construct the 
two noise-orthogonal Time Domain Interferometry (TDI) outputs [55] : 

, 2X-Y-Z Z -Y 

A and E for a given choice of the source parameters are then computed directly in 
the Fourier domain using an approximation and software implementation provided 
by Cornish and Littenberg [TJJ in which the LISA response function for nearly 
monochromatic signals is separated in a fast and a slow part. The fast part is 
computed analytically in the frequency domain, while the slow one can be evaluated 
efficiently by sampling it at a much slower rate in the time domain and then doing 
an FFT. It takes approximately 10~ 2 s, depending on the frequency bandwidth, to 
generate a single signal in the relevant range, and in order to avoid spurious oscillations 
at the edges of the frequency band due to windowing, we generate the signal in a band 
3/2 times wider than what is needed, B Wl and then select only the central part. 

The likelihood function that needs to be computed at each step of the MCMC 
Metropolis-Hastings algorithm is given by 



jV 



2 



P (d\X,W)cc- w expl-—Y,/Z\^,k-h a ,kW[\ > (?) 

I Q=l fc=l J 

where a labels the TDI channel - A and E -, k = 1, . . . , N the frequency bin, d the 
data set and h(X) the predicted signal (model) in the Fourier domain. 
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Table 1. Information about the Mock LISA Data Challenges 1B.1.1X, 
corresponding to a single galactic binary signal. Brackets represent the prior range 
of the two parameters for the different sources. The data sets are approximately 
1 yr long with a cadence of 15 seconds. 



Challenge SNR frequency (mHz) 

lB.l.la [10 , 20] [0.9 , 1.1] 

lB.l.lb [10 , 20] [2.9 , 3.1] 

lB.l.lc [10 , 20] [9 , 11] 



3. Results 

As a test of the initial implementation of the analysis algorithm, we analysed the three 
data sets of Challenge IB. 1.1, each of which contained a single galactic binary signal 
buried in Gaussian and stationary instrument noise. In Table [T] we summarise the 
main properties of the three challenge data sets; the main difference amongst them is 
the actual value and prior range of the signal frequency. In particular, in Challenge 
IB. 1.1c the frequency prior region was 10 times wider than for Challenge lB.l.la-b. 

The algorithm was exactly the same for each analysis and we adopted constant 
values for the amplitude of the proposals Aj throughout the evolution of the chains. 
For the angles, the value of Aj was set to one third of the prior range; for the frequency 
we used A/ = 3 yr -1 = 95 nHz, corresponding to three frequency bins; we evolved the 
noise level in a logarithmic scale with A\ oglJ = 0.01, and for the signal's amplitude we 
used A a = 3 x 10 -25 . These choices were motivated by reasons of efficiency, however 
for lack of time no specific tuning of the MCMC code took place. 

Some additional care is required to probe efficiently the prior range of several 
source parameters. The signal model adopted for galactic binaries is characterised by 
5 angular parameters, each of which with an intrinsic periodicity: initial phase (fo 
and longitude <j> are modulo 27r; latitude £, polarisation angle ip and orientation i arc 
modulo 7r. We have limited our parameter space in order to work always with angular 
values between and their period. The co-latitude 9, that we use in our search instead 
of latitude, is taken between and %, giving a latitude range in the usual interval 
t G (— 7r/2, 7r/2). There exists another symmetry concerning the angular parameters, 
and it is given by the fact that the transformations tpo — > tpo ± 7r or ip — > ip ± 7r/2 
produce a change of the waveform's global sign. This implies that we can use e.g. 
tpo G (0, 2n) as before, but restrict the polarisation angle range to ip G (0, 7r/2), since 
the (tt/2, 7t) range can be covered by adding or subtracting 7r to the initial phase. 

The most crucial parameter for the search for galactic binaries is the frequency, 
and it is advantageous to analyse any given data set in a number of narrow frequency 
bands. For Challenge lB.l.la-b, the signals were so strong that their frequency 
could be approximately determined by simply computing the power spectrum of the 
data; the MCMC algorithm was then run on a narrow band around that frequency 
(see Section [3~Tj) . Challenge IB. 1.1c was more challenging for us because the prior 
frequency range was 10 times larger and, the signal being at higher frequency, the 
Doppler effect spreads the power over many frequency bins without producing clearly 
identifiable peaks in the spectrum (see Section I3.2() . The results of the analyses are 
discussed in some detail in the next two subsections. 
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Challenge IB. 1.1a Challenge IB. 1.1b 



2.5 






1 








2 




A 


0.8 




A 




~ 








,r- 








N 1 c 

I 1.5 










I 0.6 








< 




I Jill 














s? 0.4 

< 












0.5 












lilWJ 




0.2 












0.9 


1 1.1 


2.9 2.95 3 3.05 


3.1 






Frequency (Hz) 






X' 


O" 3 








Frequency (Hz) 


x10" 3 



Figure 1. Power spectral density of LISA'S channel A for Challenge IB. 1.1a and 
lb. The peak produced by the actual signal is highlighted by an ellipse. 
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Figure 2. Marginalised posterior PDFs of each of the seven parameters that 
characterise the signal, and noise level for Challenge IB. 1.1a data set. The 
true values of the parameters are represented by a vertical solid line, while our 
submitted values (mean value of the distribution) together with their 95.5% 
probability interval |30| are represented by the dashed and dash-dot lines, 
respectively. Dotted lines indicate positions in the parameter space equivalent 
to the true ones. 



3.1. Challenge lB.l.la-b 

In Figure Q] we plot the power spectral density of the A TDI output from Challenge 
lB.l.la-b data sets in the relevant frequency range. It was easy to identify "by 
eye" the approximate frequency of the signal by simply looking at the highest peaks 
and checking for their presence in the noise orthogonal TDI channel, E. Using this 
procedure, we identified a narrow frequency range (50 frequency bins, corresponding 
to 1.58 x 1CP 6 Hz) for the analysis with the MCMC code to generate the posterior 
PDFs of the signal parameters. The typical length of the burn-in stage was chosen 
to be 2.5 x 10 4 steps, and the code run until 10 5 iterations of the Markov chain were 
completed, which in all cases took less than two hours to run on a 2.16 GHz processor. 
The results submitted for the challenge were obtained by combining the post-burn-in 
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Figure 3. The same plots as in Figure |21 but for Challenge IB. 1.1b data set. 



stage of five different runs, each of which used different initial data and seeds for the 
random number generator. 

In Figures [5] and [31 and Table [51 we summarise the results of the analysis. In 
order to assess the performance of our approach we compare our results with the 
parameters that describe the actual signal present in the data set - the "key" file - 
and also the outcome of the analysis from other participating groups. Our analysis 
pipeline performed in general in a satisfactory way: the true parameters of the signals 
are all within the 95.5% posterior probability interval, and the difference between the 
true and recovered mean value of each parameter is comparable with that obtained 
by other groups. Notice that the accuracy of the measurements varies considerably 
for the seven parameters of the signal: the two sky location angles can be estimated 
with an error that is typically of the order of 0.1 rad, but the error on the inclination 
angle is around 0.5 rad, and for the polarisation angle and initial phase it increases to 
~ 0.7 rad and ~ 1.5 rad, respectively. The signal frequency can be estimated with very 
high accuracy, 3.5 nHz, which represents around one tenth of a frequency resolution 
bin of width 32 nHz. Finally, the logarithm of the noise level estimation provided by 
the MCMC code has a typical error of 0.08, whereas the relative error of the signal's 
amplitude estimation is typically 35%. 

Additional quality indicators of the results of a given analysis that are used the 
context of the MLDCs, are the correlation C between the true waveform h^ oy and the 
recovered one h roc , and the recovered signal-to-noise ratio SNR roc with respect to the 
optimal one SNR]< ey [55]. Our results yield C that differs from 1 by less than 1%, 
and SNR rGC = 13.577 (SNR kcy = 13.819) for Challenge IB. 1.1a and SNR rec = 23.479 
(SNRkcy = 24.629) for Challenge IB. 1.1b. Both figures of merit are satisfactory and 
comparable with those obtained by other participating groups, sec Table O and Tabic 
1 of [23]. 

In summary, the outcome of the analysis on the Challenge lB.l.la-b data sets 
provides a successful validation of our analysis pipeline for the simplest possible 
scenario, and suggests that the inner core of our approach is sound and can be 
regarded as a solid starting point for more complex searches. However, a realistic 
analysis of LISA data needs to deal with tens of thousands of white dwarf binary 
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Table 2. Summary of the results for Challenge lB.l.la-b. We show the mean 
values and the 95.5% probability interval of the recovered parameters. We 
also present the true values from key files, the difference between them and 
our submitted results [AA (ours)] and the median of the results submitted by 
other participating groups |23| [AA (typ)]. We also show the optimal SNR 
(from the key file), the recovered SNR (values of SNR in A A columns represent 
SNR kcy — SNR roc |/SNR kcy ), as well as the correlation between the true waveform 
and the recovered one, C. 
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systems, distributed over a wide (~ tens of mHz) frequency window and for a range 
of signal-to- noise ratios. 

Due to the nature of the A parameter, we should expect the distribution of log A 
to be approximately symmetrical, whereas we have sampled and shown A itself, which 
is asymmetrical. The cos(t) parameter is correlated with A, and so its posterior 
distribution is also asymmetrical. 

At the time of the deadline for Challenge IB our analysis approach was still 
immature and the first natural extension of the code - i.e. the ability to search 
automatically over a wider frequency band than that required for Challenge IB. 1.1a- 
b - was still under development. In fact, results were not submitted for Challenge 
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(a) training IB. 1.1c 
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Figure 4. Looking for the frequency value in Challenge IB. 1.1c, we used the 
final ratio between the number of accepted transitions in /, over the number of 
proposed, as a signal presence indicator, expecting to find a minimum. Here we 
plot this quantity for the 1000 different frequency chunks we divided the data into, 
for (a) the training data set and (b) the challenge data set. For this latter case, 
we highlight with an ellipse the absolute minimum of this quantity, expecting to 
find a signal in that frequency region. 



lB.l.lc. For completeness, in the next Section we shortly summarise the problems 
that were encountered and justify the lack of an entry for Challenge IB. 1.1c. This 
also provides a natural justification for the on-going development work. 



3.2. Challenge IB. 1.1c 

The efficiency of an MCMC search method for galactic binaries depends crucially on 
the width of the prior frequency range of the signal. For Challenge lB.l.la-b we were 
able to further restrict the prior range following the ad hoc strategy described in the 
previous Section. This was not possible for Challenge IB. 1.1c, as no clear peak(s) 
could be identified in the raw power spectrum of the data. This is due to several 
(simple) reasons: the wider (by a factor 10) prior frequency interval and the higher 
frequency of the signal, that produces a larger absolute Doppler frequency shift and 
therefore spreads the signal power over many frequency bins. 

In principle one could just let the MCMC code run over a 2 mHz band and 
eventually the algorithm should converge; however this can take an unreasonably long 
time and does not provide any real advance in developing an algorithm for more 
complex and realistic situations. The interim solution that we adopted was to divide 
the whole frequency band into 1000 overlapping intervals of 63 frequency bins per 
interval, and we run the MCMC code in each of them. As an indicator for the presence 
of a signal, we used the final ratio between the number of accepted and proposed 
transitions in /, expecting to find a minimum - the Markov chain has converged and 
the transition acceptance probability diminishes - when a galactic binary signal is 
indeed present. 

We tested this procedure on the training data set and we found a minimum in the 
frequency chunk that contained the signal, see Figure [4^. When running the analysis 
on the challenge data set, the results contained some additional complication, and 
this is reported in Figures SJd and We could correctly establish that the signal 
was confined to the region spanned by two frequency chunks (corresponding to the 
number 471 and 472, see Figure|4j)) covering the band 9.942 mHz— 9.946 mHz, which is 
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Figure 5. Marginalised PDFs of frequency and sky location (here, 9 represents 
the co-latitude) parameters, after running our MCMC code on the Challenge 
IB. 1.1c data set around the frequency value identified from Figure |4p. Cases 
(a), (b) and (c) correspond to independent results obtained just by changing the 
starting values of the chain and the initialisation of the random number generator. 



consistent with the true value of the key file, /true = 9.94347 mHz. However, running 
three independent chains with arbitrary starting values and different initialisation 
seeds produced posterior PDFs which provided ambiguous results. In Figure [5] we 
plot the posterior PDFs of three parameters (frequency and sky location), for the 
three chains. We see that for case (a) the MCMC converged on the correct point 
in the parameter space, but for (b) the chain converged onto a secondary maximum 
of the likelihood function, and in (c) it did not even converge. This is the effect 
of a known property of the likelihood function produced by galactic binary signals 
that is characterised by multiple peaks at a frequency separation w lyr^ 1 [5] that 
our code was not yet sufficiently mature to deal with. Some ad-hoc tests could be 
used in order to discriminate the chains that converged to the actual place in the 
parameter space from the other ones, like comparing the recovered SNR or looking at 
the residuals when the recovered source was subtracted, but we are focussing on the 
future applications of the algorithm where it will be necessary to have all the chains 
sampling properly the posterior PDFs in a reasonable number of steps. Thus, work is 
currently on-going to address this issue, while retaining the Markovian properties of 
the chains. It should be noted that the problem of multimodal posterior distributions 
is more general, and also present in other sources considered so far in the MLDCs, for 
example massive-black-hole binary inspirals and extreme mass-ratio inspirals. 
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4. Conclusions and future work 

Wc have begun the development of an end-to-end analysis algorithm to identify and 
study stellar-mass galactic binary systems in a Bayesian framework. The core of the 
analysis is based on an application of MCMC techniques. Using a first implementation 
of some of the building blocks of this pipeline we have successfully analysed the 
simplest single-source Challenge lB.l.la-b data sets: the true parameters of the 
signals all lay within the 95.5% posterior probability interval, the combined A and 
E recovered signal-to-noisc ratio exceeds 95% of the optimal one and the correlation 
between the recovered and the true waveform is above 99%. This performance is 
competitive with that achieved by other groups who analysed the same data sets. 
Due to the immaturity of the algorithm in dealing with the multiply-peaked structure 
of the likelihood function for galactic binary signals at higher frequencies, results were 
not submitted for the Challenge IB. 1.1c data set. 

Work is currently on-going to address the limitations of the present MCMC 
implementation. In particular, an extension to this algorithm using a Delayed 
Rejection MCMC technique has been developed and is undergoing testing and 
validation. A description of this extended technique is in preparation. Our long- 
term goal is to implement a multi-source version using RJMCMC in order to perform 
model-selection on the number of sources. 
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